An optimized algorithm for Data Oriented Parsing

نویسنده

  • Khalil Sima'an
چکیده

This paper presents an optimization of a syntactic disambiguation algorithm for Data Oriented Parsing (DOP) (Bod 93) in particular, and for Stochastic Tree-Substitution Grammars (STSGs) in general. The main advantage of this algorithm on existing alternatives ((Bod 93), (Schabes & Waters 93), (Sima'an et al. 94)) is that its time-complexity is linear, instead of square, in grammar-size (and cubic in sentence length). It is particularly suitable for natural language STSGs which have many deep elementary-trees and a small underlying Context-Free Grammar (CFG). A rst implementation of this algorithm is operational and is exhibiting substantial speed up in comparison to the unop-timized version. In addition to presenting the optimized algorithm, the paper reports experiments for measuring the disambiguation-accuracy, the expected sizes and the execution-times of various DOP models, which are projected from the ATIS domain. 1 Motivation Many models of natural language performance tend to train presupposed grammars in order to extend them probabilistically (e.g. (Schabes & Waters 93), (Black et al. 93)). In contrast , Data Oriented Parsing (DOP), suggested by Scha (Scha 90) and developed by Bod (Bod 92), projects an STSG directly from a given tree-bank. DOP projects an STSG by decomposing each tree in the tree-bank in all ways, at zero or more internal nodes each time, obtaining a set of constituent structures, which then serves as the elementary-trees set of an STSG. An STSG is basically a Context-Free Grammar (CFG) with \rules" (or \productions") which have internal structure i.e are (elementary-)trees. Deriving a parse for a given sentence in STSG is combining elementary-trees using the same substitution operation as used by CFGs. In contrast to CFGs, however, STSGs allow various derivations to generate the same parse. Crucial for natural language disambiguation, the set of trees generated by combining the elementary-trees of an STSG are not always gen-erateable by a CFG; thus, STSGs impose extra constraints on the generated structures. For selecting a distinguished structure from the space of generated structures for a given sentence, DOP assigns probabilities to the application of elementary-trees in derivations. The probability, which DOP inferres for each elementary-tree, is the ratio between the number of its appearances in the tree-bank (i.e. either as a tree or as a sub-tree) and the total number of appearances of all elementary-trees which share with it the same root non-terminal (see gure 1). A derivation's probability is then deened as the multiplication of the probabilities of …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

Darwinised Data-Oriented Parsing - Statistical NLP with Added Sex and Death

We present the Darwinised DataOriented Parsing algorithm, an incremental, dy-namic form of Data-Oriented Parsing, in which exemplars are used as replicators, subject to a selection pressure towards gen-eralisability.

متن کامل

An Optimized PID for Capsubots using Modified Chaotic Genetic Algorithm (RESEARCH NOTE)

This paper proposes a design for a mesoscale capsule robot which can be used in gaining diagnostic data and delivering medical treatment in inaccessible parts of the human body. A novel approach is presented for the capsule robot control: A PID-controlled closed-loop approach. A modified chaotic genetic algorithm will be used to optimize the coefficients of PID controller. Then, simulation will...

متن کامل

Identifying Flow Units Using an Artificial Neural Network Approach Optimized by the Imperialist Competitive Algorithm

The spatial distribution of petrophysical properties within the reservoirs is one of the most important factors in reservoir characterization. Flow units are the continuous body over a specific reservoir volume within which the geological and petrophysical properties are the same. Accordingly, an accurate prediction of flow units is a major task to achieve a reliable petrophysical description o...

متن کامل

AN EXPERIMENTAL INVESTIGATION OF THE SOUNDS OF SILENCE METAHEURISTIC FOR THE MULTI-MODE RESOURCE-CONSTRAINED PROJECT SCHEDULING WITH PRE-OPTIMIZED REPERTOIRE ON THE HARDEST MMLIB+ SET

This paper presents an experimental investigation of the Sounds of Silence (SoS) harmony search metaheuristic for the multi-mode resource-constrained project scheduling problem (MRCPSP) using a pre-optimized starting repertoire. The presented algorithm is based on the time oriented version of the SoS harmony search metaheuristic developed by Csébfalvi et al. [1] for the single-mode resource-con...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996